^ ■ Normalized information-based divergences 

o 



> 



> 



s 

X 



By J.-F. Coeurjolly, R. Drouilhet and J.-F. Robineau 



, Q ! University of Grenoble 2, France 

m ■ February 2, 2008 



Abstract 



H 

C^ ■ This paper is devoted to the mathematical study of some divergences based on 

the mutual information well-suited to categorical random vectors. These diver- 
gences are generalizations of the "entropy distance" and "information distance". 
Their main characteristic is that they combine a complexity term and the mu- 



^O . tual information. We then introduce the notion of (normalized) information- 

^vj . based divergence, propose several examples and discuss their mathematical 

■^ I properties in particular in some prediction framework. 
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1 Introduction 

Shannon information theory, usually just called information theory was introduced in 
1948, Shannon (1948). The theory aims at providing a means for measuring information. 
More precisely, the amount of information in an object may be measured by its entropy 
and may be interpreted as the length of the description of the object by some encoding 
way. In the Shannon approach, the objects to be encoded are assumed to be outcomes of 
a known source. Shannon theory also provides the notion of mutual information (related 
to two objects) which plays a central role in many applications, from lossy compression 
to machine learning methods. 

Several authors noticed that it would be useful to modify the mutual information 
such that the resulting quantity becomes a metric in a strict sense. As a first example, 
Crutchfield (1990), Hillman (1998) introduced the entropy distance defined as the sum 
of the conditional entropies. Other interesting measures are the information distance 
Bennett et al. (1998) and its normalized version named similarity metric introduced by 
Li et al. (2004) in the context of the Kolmogorov complexity theory. More precisely, 
the information distance is defined as the maximum of the conditional Kolmogorov 
complexities. The similarity metric is universal in the sense defined by the authors 
and is not computable, since it is based on the uncomputable notion of Kolmogorov 
complexity. 

Recent papers have demonstrated useful application of suitable version of the sim- 
ilarity metric in areas as diverse as genomics, virology, languages, literature, music, 
handwritten digits and astronomy, Cilibrasi and Vitanyi (2005b). To apply the metric 
to real data, the authors have to replace the use of the noncomputable Kolmogorov 
complexity by an approximation using standard real-world compressors : GenCompress 
for genomics, Li et al. (2001), the Normalized Compression Distance (NCD) for music 
clustering, Cilibrasi et al. (2003), the Normalized Google Distance (NGD) for automatic 
meaning discovery, Cilibrasi and Vitanyi (2005a), are examples of effective compressors. 
To include the information distance and the similarity metric in a framework based on 
information theory concepts, we make use of the principle that expected Kolmogorov 
complexity equals Shannon entropy and interested reader can refer to Griinwald and 
Vitanyi (2004), Leung- Yan-Cheong and Cover (1978), Hammer et al. (2000) for more 
details. Consequently, the entropy and information distances are both expressed in terms 
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of conditional entropies: the first one as their sum and the second one as their maximum. 
Kraskov et al. (2003) gives a proof of the triangular inequality for these distances and 
their respective normalized versions. 

In the supervised learning framework, the use of some selection method of covariables 
amoung a large number is required when it is assumed that the data size is too small 
with respect to the number of the available covariables in order to apply any existing 
discriminant analysis method. Such a problem has been widely treated, Liu and Motoda 
(1998). The approach undertaken by Robineau (2004) is mainly based on three kinds 
of methodological tools. The first one is a supervised quantization method consisting in 
the simplification of covariables too complex (in particular with a too large number of 
possible values). Indeed, our main belief is that, in order to predict the class variable 
generally representing a small number of categories of data, each possibly predictive 
covariable must not be too complex. The second one is a more usual step by step selection 
method combining the simplified covariables together in order to detect cluster of data 
of the same class. The last one is aimed at detecting redundancy among the covariables 
set. These three tasks may be realized using the entropy or information distances (or 
their normalized versions) . Let us emphasize some properties allowing to understand the 
usefulness of these criterions in such a context. The entropy and information distances 

E I 

D and D can be rewritten as the difference between some term, respectively the joint 
entropy and the maximum of the marginal entropies, and the mutual information. The 
first term may be interpreted as a complexity term. Moreover, both are independence 
measures with the particular property to be minimal (in fact equal to 0) when random 
vectors share exactly the same information. Robineau (2004) proposes then to extend 
the definition of the entropy and information distances by introducing the notion of 
information-based divergence A^ ^ between two categorical random vectors X and Y 
defined as the difference of some complexity term C and the mutual information I^^ ^ 
and such that C is an upper bound of I^ ^ reached when X and Y share exactly 
the same information. The notion of normalized information-based divergence 5^ ^ 
derives directly by dividing the associated information-based divergence A by the 
complexity term C^ ^ . The normalized version d and d oi D and D are particular 
examples. Other examples are given in Robineau (2004). Amoung them, one is of 
particular interest since its complexity term C^ is the mean of the marginal entropies. 
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The associated (non-normalized) information-based divergence A"^ is not so different 
from D since it corresponds to its half. Nevertheless, the expression of its complexity 
term C really differs from the complexity term C of D (i.e. the joint entropy). For 
pratical purposes, we may argue that D , D and A"^ are not well-suited in prediction 
framework since a small value of these distances means that both the explained and 
explicative variables have a good knowledge of each other. This is due to the fact that 
both conditional entropies have at least the same weight. 

In this paper, this drawback is weakened by introducing a natural extension C '" of 
the complexity term C"^ defined as a weighted mean (by a and (1— a) for some < a < 1) 
of the minimum and maximum of marginal entropies. This kind of complexity term leads 
to an expected IB-divergence, A"^'" which is the weighted mean of the minimum and 
maximum of conditional entropies. 

The paper is organized as follows. In Section 2, we recall the definition and their 
main properties of the entropy and information distances (and their normalized version). 
Similarly to Granger et al. (2004), we extract the main characteristics to define some 
general concept of information divergence which could be theoretically applied in a more 
general setting (continuous, discrete, ...). Section 3 concentrates itself on categorical 
data (and in particular discrete) random vectors, as it is usually the case in most of ap- 
plications using entropy or information distance. We give the definition of (normalized) 
information-based divergence and propose several examples. We study their mathe- 
matical properties in a general context and propose some sufficient conditions for these 
divergences to verify some triangular's type inequality. Finally, in Section 4, we exhibit 
some properties of information-based divergences in the special prediction framework. 
In particular, we show that these divergences are useful to detect redundancy. 

2 Normalized entropy distance and normalized informa- 
tion distance 

Let us denote by T the set of categorical random vectors, that is, discrete- valued random 
vectors with finite entropy. In the sequel, X, Y and Z are three elements of such a set F. 
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2.1 Some notation 

We denote by H-^ (when it exists) the Shannon entropy of X given by 

Hx = - ^ Px{x) log{px{x)) with px{x) = F{X = x), 

x£Ux 

In the same way, one can define the joint entropy of X and Y denoted by H^ ^ i the 
conditional entropy of X (resp. Y) by Y (resp. X) denoted by H^ (resp. H^ ). 
Finally, we denote by I^ y the mutual information between the random vectors X and 
Y . When these different quantities exist, the following relations hold (see e.g. Cover 
and Thomas (1991)): 

Hx,Y =Hx + Hy\x = Hy + H^^Y (1) 

^X,Y = ^X - Hx\Y = ^Y - Hy\X = Hx+ Hy - H^Y (2) 

2.2 Definition and characteristics 

We now shall present some measures allowing to overcome some drawbacks of the mutual 
information. As a first generalization, several authors noticed that it would be useful to 
modify the mutual information such that the resulting quantity becomes a metric in a 
strict sense. Two such measures exist and are well-known in the litterature. The first one 
called "entropy distance" is derived from the domain of information theory. The second 
one called "information distance" originates in works around the Kolmogorov complexity. 
Both measures are defined (when they exist) for two random vectors X and Y by: 



• Entropy distance: 



Information distance: 



E 
^XY =^X\Y +^Y\X (3) 



d'x.y='^^'^[Hx^y^Hy^x]- (4) 



Both measures are indeed some modifications of mutual information since from (1) 
and (2), we have 

-^x.y = ^x,Y - Ix,Y and d'^^ = max {H^,Hy) - /^ y. (5) 
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The quantities H^ ^ ^^'^ max (H^^ , H^ ) are upper-bounds of the mutual information 
I^ Y that are reached when X and Y share exactly the same information. In other 
words, these two measures are nonnegative and vanish if and only if H^,^ = H^ = 
expressing the fact that X (resp. Y) predicts Y (resp. X) with probability 1. 
These measures satisfy 

D^Y < H^Y and d'^ ^ < max {H^ , F^ ) , (6) 

where the equality holds if the vectors X and Y are independent. As noticed by Kaltchenko 
(2004), Li and Vitanyi (1997) argued that in Bioinformatics an unnormalized distance 
may not be a proper evolutionary distance measure. It would put two long and complex 
sequences that differ only by a tiny fraction of the total information as dissimilar as two 
short sequences that differ by the same absolute amount and are completely random 
with respect to one another. To overcome this problem within the algorithmic frame- 
work Li and Vitanyi (1997) form two normalized versions of distances D and D . Their 
Shannon version have been proposed and studied by Kraskov et al. (2003) 

Definition 1 When they exist, one defines the two following measures: 

• Normalized entropy distance: 

J-f -\- M 

,E X\Y ' Y\X 



X,Y 



Normalized information distance: 



I max (^i7^|y,i:?y|^ 

d 



^'^ max {H^,Hy) 

Since H^ ^ = <^ H-^ = H^ = <^ max {H^ , H^ ) =Q, we set by convention d^ ^ — ^ 
(resp. d-^ Y —^) when H^ = H^ = 0. 

We are encouraged to define the following class of equivalence: the vectors X and 
Y are said to be equivalent if X (resp. Y) predicts Y (resp. X) with probability 1 and 
one will denote 

X ^Y <!^ ify = H^ = <^ I^Y = Hx,Y = ^x = ^Y i^) 
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Due to the previous convention 

From (1) and (2), one can obtain the foUowing expressions for these two measures al- 
lowing some new interpretations. 

Proposition 1 We have the following expressions for d^ and d . 

dl. = 1 - ^ (8) 



J 



I 



I ^X\Y ^Y\X \ .. „N 

Proposition 2 The measures d^ et d constitute two distances hounded by 1. 

To our knowledge, these results have been proved by Kraskov et al. (2003). Proofs 
are very similar to proofs of Li et al. (2003) who consider the algorithmic version of these 
distances. The proof is then omitted, but in Section 3.3, we propose a result extending 
this one in the sense that we give conditions on measures that can be written as (8) 
and (9) to constitute a metric. 

2.3 Concept of information divergence 

We can exhibit from the previous study related to D , D , d and d^ , some charac- 
teristics useful for an attempt to define the concept of information divergence denoted 
by A in a more general setting. Let us first consider a similarity measure Ix,Y (not 
necessarily the mutual information) minimal (in fact equal to 0) when X and Y are 
independent, and maximal (in fact equal to Ix,x = ^y,y) when the distributions of X 
given Y = y and Y given X = x are trivial. An information divergence A could 
satisfy the following properties: 

[PI] symmetry: A^ ^ = A^^. 

[P2] nonnegativeness: A > 0. 
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[P3] A is minimum (i.e. A = 0) if and only if X and Y share exactly the 
same information (i.e. Tx,y is maximal). 

[P4] A is maximum if and only if X and Y are independent (i.e. Txx = 0). 

Other supplementary properties could be that A : 

[P5] is normalized: A G [0, 1] and A =1 when X and Y are independent. 

[P6l satisfies a triangular inequality: A^^ <A^„+A„^. 

[P7] invariant under continuous and strictly increasing transformations V'(-), tp{-) 
of the vectors X and Y ^ whenever they are quantitative random vectors. 

There exists a large litterature on the discussion of criteria satisying the previous 
stated properties. We may cite Ullah (1996), or a recent work of Granger et al. (2004) 
who propose to detect the dependence between two possibly nonlinear processes through 
the Bhattacharya-Matusita-Hellinger measure of dependence given by 

^''^2 1 I ("^-^i^^'^)" \/f2{x,yyj dxdy, 

where /i (resp. /2) is the joint density (resp. the product of marginal densities) of X 
and Y. This measure, that has the other advantage to be applicable to continuous or 
discrete variables, satisfies properties [P1]-[P7] (in fact let us precise that [P7] is only 
valid if (fi-) = ip{-)). 

In some framework where the purpose is to predict some reference variable, one may 
find interesting to work with a divergence A^ ^ which combines the minimization of a 
nonnegative complexity term denoted by Cx,Y and the maximization of a nonnegative 
information term Ix,Y- The quantity Cx,Y is called a complexity term since it is 
assumed to be expressed as a function of TCxi T~^Y and TCx,Y measuring in some way 
respectively the complexity of vectors X , Y and {X , Y). In other words, we may expect 
that an information divergence A^ ^ could also satisfy the following properties: 

[P8] When Xi and X2 have the same complexity (in the sense that Cy,Xi = 
Cy,X2)' ^y X ^ ^Y X whenever Xi has a better knowledge about Y than X2 
(i.e. Xy,Xi >Iy,x.2)- 
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[P9] When Xi and X2 have the same knowledge about Y (i.e. Iy,Xi = ^Y,X2)'- 
Ay ^ < Ay ^ whenever Xi is simpler than X2 in the sense that Cy,Xi < C-Y,X2- 
Moreover, in this particular situation the fact that 

[PIO] Cy,Xx < C.Y,X2 must be equivalent to Ttxx ^ Wx2- 

[Pll] When Xi and X2 share almost exactly the same information (i.e. Ixi,X2 
is almost maximal and A ~ 0) then the difference between the divergences 

A^ ^ and A^ ^ is almost zero (i.e. A^ ^ ~ A^ ^ ). 

Y,Xi Y ,X2 ^ Y.Xi Y,X2^ 

A class of candidates that satisfy [P8] and [P9] the previous statements could be of the 
form: 

A = ^^'^~^^'^ (11) 

where Wx,Y is a positive term. When Wx,Y = C-x,Y we obtain a normalized informa- 
tion divergence. The properties [P2]-[P3] and the form (11) implies that Cx,Y is an 
upper bound of Tx,Y reached when X and Y share exactly the same information. 

In the rest of this paper we concentrate ourself on criteria described by (11) that are 
in addition well-suited to categorical random variables (and in particular discrete random 
variables). In such a framework, we shall only describe some entropic-based criteria (i.e. 
7ix = H^ ) , and so the information term will be set to the mutual information I^ ^ ■ 

3 Information-based divergences and their normalized ver- 
sions 

3.1 Definition and examples 

Definition 2 Two criteria A and 5 are respectively called an information-based diver- 
gence and a normalized information-based divergence (in short IB-divergence and NIB- 
divergence ) if they can respectively be written 

(12) 
(13) 

where the term C constitutes a complexity term satisfying 



\ — C T 

X,Y X.Y X,Y 

C -I 

, _ ^X,Y ^X,Y 


= 1- 


^X.Y 


X.Y r< 

X,Y 


Cx.Y 
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(^) ^..v = C. 



X,Y Y .X 



(ii) I^ Y < C^ y and this bound is achieved if and only if the random vectors X and 
Y are equivalent, i.e. if and only if X ^ Y . 

We set by convention (5^^ y = when C^ ^ — ^x y ~ ^■ 

This definition implies automatically that an IB-divergence A^ ^ (resp. a NIB- 
divergence <^„y) satisfies properties [P1]-[P4] (resp. [P1]-[P5]). In the rest of the 
paper, the term C^ ^ is expressed as 



'^x,Y = fc [^x\y^^y\x^^x,y) 7 (14) 

where fc{', ', •) is a nonnegative function. Under such an expression of C , the property 
[P7] is ensured since the conditional entropies and the mutual information depend only 
on the joint probability distribution of the categorical random vectors X and Y. 

From now on, we propose a series of examples for which we adopt the following con- 
vention: an IB-divergence (resp. a NIB-divergence ) satisfying the triangular inequality 
is denoted D{resp. d) rather than A (resp. 5) . Moreover, each example will be 
particularized by some discriminating additonal letter in the same manner as D and 
D (resp. d and d ) which clearly constitute IB-divergences (resp. NIB-divergences) . 

In Robineau (2004), we investigate about two new entropic criteria naturally ex- 
pressed by 

SD =1(^ + ^1^) and 6' - ^1^ ^1^ 



which can be rewritten as NIB-divergences: 

CY=^-kf- With c^=(u^+^]y' (15) 



'x,Y ' cD^ ^.y \2 \H^ H^ 



4.=!-^ with ci^=]^{H^+HY). (16) 



X,Y 



Their non normalized version are expressed as A^ = C^ v~^x y ^'^'1 ^% v ~ ^f 



X.Y X,Y X,Y x.Y X.Y 



^X,Y ■ 
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In this paper, we are interested in a large family of IB-divergence or NIB-divergence 
with complexity terms of the form: 

^x V = 9'^ (" X ai^xx) + (1 - a) X g{Mx,Y)) (17) 



with mx,Y = min {H^ ) H^ ) and Mx,Y = rnax {H^ , H^ ) and where < a < 1 and 
g{-) is any monotone function on M+. When it is not ambiguous we set m = mx,Y and 
M = Mx,Y- To be convinced that IB-divergences and NIB-divergences with complexity 
terms of the form (17) satisfy (ii) of Definition 2, let us notice that 

^x,y = 9~^ M^x,y) + (1 - ")5(/x,y)) < 9^^ (Mm) + (1 - a)g{M)) . 

When a = 0, the complexity term C" corresponds to C^ . When a = 1 the complexity 
term defined by min {H-^ , H^ ) and denoted by (7™™ does not satisfy {ii) of Definition 2 
and then [P3]. The associated A°"° (resp. ^mm^ jg ^ot an IB-divergence (resp. a 
NIB-divergence) . 

We pay now particular attention on the complexity terms C '" , C '" , C '"^ 
and C'^'" of the form (17) respectively with g^{-) = 1/-, g^{-) = •, 5^(0 = V^ and 
<(''(•) = log(-): 

C^." = amin (i?^ , i^^ ) + (1 - a)max {H^ ,H^). (19) 



^Jy = U^^^{H^,H^) + (1 - a)^ma^{H^,H^)] (20) 

C^'^ = min (//^ , i7^ )"max {H^ , H^ )^-°. (21) 

The previous measures A*^, 6^ , A^ and 6^ are particular examples of such a family 
since the value of a = ^ leads to C^/^ = g^^ [^g{H^) + ^giH^))- When a = \, A*'" 
and (5*'" will be simply denoted by A* and 5' where • stands for S, R, P and D. 

Let us first comment the particular expressions of the divergences A '" and (5 '" 
associated to C '" and C '" given by: 

^x"y ^ «niin [H^^Y^H^^^] +{l- a) max (^^ly^yix 
= aA-- + (1 - a)Z;^ ^ 

•;D,a ■ I x\Y ^Y\x\ , /-I N I ^X\Y ^Y\X 



' H^ ' H^ J ' ' \iix Hy 



aSfl + (1 - a)d' 
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Clearly, the previous representation of A^'" (resp. <J? ^ ) as a convex combination 
of A™™ and Di^ ^ (resp. 5™™ and d^ ^ ) introduces a degree of freedom that 
could be useful for practical purposes in prediction framework where Y could represent 
some class variable. According to the parameter a one may favour to take into account 

J-f fJ 

between one or two prediction terms amoung H^.^ and H^ (resp. ^ and — ^-^)- 
This possibility to introduce a non uniform mixing of the entropic contributions in the 
expression of the complexity terms seems to be not feasible by a direct adaptation of 

Remark 1 By choosing g{-) = (•)'>' for some 7 > 0, the complexity term, is given by 
C^'y = II (a^m, (1 — a)^Mj ||^, where \\x\\y = [^i=i\xi\'^] denotes the norm of 
some vector x of length 2. Note that for any < a < 1, we have 

{a^)^\\{H,,H,)\\, < Cl''^^ < («^)^||(i7,,i7,)||„ 

with a^ = min(a, 1— a) and a^ = max(a, 1 — a). When 7 goes to infinity C^'"^ converges 
towards C^ ^ . 

Remark 2 The complexity term C" is invariant under linear transformation of g. In 
particular, g and —g provide the same complexity term. Consequently, without loss of 
generality we could restrict g to be an increasing function. 

Let us now propose a result to arrange these different examples considered in this 
paper. Before, some preliminary result is given. 

Lemma 3 Let C^^' and C^^' two complexity terms of the form (17) with function gi 
and (72 ■ Assume either that the function gi o g^ is concave or that the function (72 ° di 
is convex, then C^^>^ < C^^'^ 

Proof. By rewriting (/I = (5^1 0^2^ ) 0(72 when (71 o^^ is concave and grj" =5^ °{g2°9i ) 
when ((72 o gi ) is convex, one may assert 

_1, / ^ ,, ^ /,.^^ ^ ) ^2^ ^ ("(52 O Sf ^ ) ° 5l ("^) + (1 " «) (^2 O fi-f ^ ) O 51 (M) ) 

ffi {agi[m) + [l - a)gi[M)) < I 

{ 9i {gi ° 52 {ag2{m) + (1 - 0)52 (M))) 

< g2\ag2im) + {I - a)g2{M)) 
where m = min {H^ , H^. ) and M = max {H-^ , H^ ) . ■ 
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Proposition 4 For any A^^', A^^^ IB-divergences or any 5^^' , 5^"^' NIB-divergences with 
respective complexity terms C^^^' and C^"^' , the following equivalence holds: 

X,Y — X,Y X,Y — X,Y X,Y — X ,Y ^ > 

Since, for any < a < a' < 1, 

^x,. <C^i,. «-^ ^x,. <<. (23) 

the associated IB-divergences and NIB-divergences are then ordered according to equa- 
tion (22). Furthermore, a similar result holds for the main examples of this paper since 



X ,Y — X,Y — X ,Y — X,Y — X,Y — X ,Y ^ > 

Proof. Equation (22) is direct. The left-hand side of (23) conies from 

^x,y = 9~^ {a9{rnin{H^,H^)) + (1 - a)<7(max(F^,i7y))) < g-^ {g {max{H^,H^))) = C^_^, 

and the right-hand side is direct. Sinceg o(^g^)~^(^-) = —log{-),go[g^)^^[-) = exp(2-) 
and g^ o {g^)^^{-) = (•)^ are convex functions, (24) is a direct consequence of Lemma 3. 
■ 

Remark 3 By assuming either that g{-) is a convex function or that g~^{-) is a concave 
function, the following inequality holds 



C^.y < am + (1 - a)M = Cf 



S,a 
Y 



which means that any A" (resp. S'^) (satisfying the previous assumption) is upper 
bounded by A"^'" (resp. 5"^'"^. 

The following proposition gives a larger class of examples of IB-divergences and NIB- 
divergences. 

Proposition 5 Let {a^-^')j=i,...,j be some vector of probability weights for some J > 1. 
(i) Let 6^^' , . . . , 6^'^', J NIB-divergences, then the measure defined by 

'^x..=E«^^'^4V (25) 
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is a NIB- divergence with complexity term given by 



^ v{i) 



\j = l '^X.Y j 

(ii) Let A^^), . . . , A'-^', J IB-divergences and 6^^' , . . . ,S^^' , J NIB -divergences with 
complexity terms C^^'^ , • • • , C*^ y then the measures defined by 

K..=C^^^-I.Y and 5^.,=1-^, wUh C^^^=Y.^^^^C(JI 

(27) 
are also respectively an IB-divergence and a NIB-divergence. 

The proof is immediate. 

3.2 Around the property [P3] 

The fact that an IB-divergence A(resp. NIB-divergence 5) satisfies the property [P3] 
may be expressed by: A^ ^ = ^^^ D^^ ^ = (resp. ^x y ~ ^ "^ ^x y ~ ^^' ^^ 
fact, [P3] should be extended to the more useful assumption: A^ ^ (or 5^ y) is near 
from minimum if and only if X and Y share almost the same information. This 
may be translated by the following implications related to an IB-divergence A (resp. a 
NIB-divergence 6): 

• for all 7 > there exists e > such that for all {X, Y) & T 

A^^<e^Dl^<j (resp. 5^ , < e ^ < , < 7). 

• for all e > there exists 7 > such that for all (X, 1^) G T 

Di^<^^A^^<e (resp. d^^ < j ^ 5^^ < e). 

An IB-divergence A (resp. a NIB-divergence 5) inherits of the previous property if it 
satisfies: 

[P3bis(T, ki, ^2)] there exists some positive constants ki, ^2 {ki < ^2) such that for 
aU {X,Y) G Tcr^: 

ki Dl^^ < A^ y < k2 D'^^ (resp. ki ^ ^ < 5^,^ < ^2 4,^)- (28) 



IS 



Normalized information-based divergences 15 



Among our examples, we assert that D and d-^both satisfy [P3bis(r^, 1, 2)] that 
D^ < D^ < 2D^ (resp. d^ < d^ < 2d^ ). 

X,Y — X,Y — X,Y ^ t' X,Y — X,Y — X ,Y ' 

Most of complexity terms considered in this paper are of the particular form (17) 
where the function g{-) is a monotone function on M+. From (23), we can point out 
that for such complexity terms (expressed in terms of A or (5), the constant k2 is equal 
to 1. Moreover, we assert that if Asatisfies [P3bis(T, /ci, 1)] then the associated (5also 
satisfies [P3bis(T, fci, 1)] since 

. kiD'^ ^ A^ ^ 

^1 '^x,Y = ri ' ^ 7^ = K.Y- 

^X,Y ^X,Y 

And so in the rest of this section, the results presented hereafter will be only expressed 
for IB-divergences. 

Furthermore, we now consider only complexity terms of the form (17) defined through 
a function (7(-) continuously differentiable on some set T)^ C M^. Let us first introduce 
the two following subsets of T)^: 

£:f = {e C P^ : < .f„,e < <up,e < +-} and £:|'" = (e C £{ : ^^ < l| , 

with K?^f Q = infa;g0 |5''(a;)| and k? q = sup^gg, |5''(a;)|. Denote also by a^ = min(a, 1 — 
a). 

In the sequel, two results ensuring that an IB-divergence A" of the form (17) sat- 
isfies [P3bis(T, fci, /C2)], are proposed. The difference relies upon the framework: the 
constants k\ and k^ differ whenever the set T differs. 

Proposition 6 For any @ £ £f the IB-divergence A" satisfies 

[P3bis(T0,a^^$^,l)] withTe = {{X,Y) eT^ : H^,H^,I^ ^ e &} . 

sup,© ' 

Proof. Denote by x = min ( //^.y , i^^,-^ )i U = niax ( ii^-^ , i^^^ ) and z = Ixy- 
There exists ci , C2 , C3 such that 

-1 ( 9{x + z) + g{y + z)\ _ r r r , \ r w , n \/ r , \ / ^^^ ^ -iv^ 



^-i I ^^"^" ' ' j - z = {a{g{x + z)- g{z)) + (1 - a){g{y + z) - g{z))) ig-'Yic,) 

= a|5'(c2)||(5-')'(ci)| xx + (l-a)|5'(c3)||(5-')'(ci)| xy. 
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with ci G [min {g{z), ag{x + z) + {1 — a)g{y + z)) , max {g{z), ag{x + z) + (1 — a)g{y + z))], 
C2 G [z, X + z\ and C3 G [z, y + z]. Then, we obtain for ah x, y, z: 



1 fg{x + z) +g{y + z) 



z > a — max(x, y) 

'^sup,e 



which means that a^^^^^^^^^ DI „ < A^ 



sup,0 



Proposition 7 For any Q e E^ the IB-divergence A^ satisfies [P3his(r%, 1— a IT^'^ , 1)1 
withTe = {Z £T : H^ £Q}. 



ai,e 



Proof. 

Di ^ - A'^^ ^ = Cl ^ - C^ ^ = a{g-'y{c,) {g{m^xiH^,H^)) - g{mm{H^,H^))) 

= a\ig'^nci)\\g'ic2)\\H^-H^\, 

with ci G [5f(min(i7-^, ii/^y )),5r(max(//j^, i7y ))] and C2 G [min(i7-^, i/^ ),max(i7^, ii/'y )]. 
Then we obtain 



K,. - K^. < " .. 



K- 



sup,e ^/ 



inf,0 

which leads to the result. ■ 

For sake of simphcity, we denote by k*^^ q and n* q instead of k?^^ q and k^ q 
The following result is devoted to our different examples. We apply the two previous 

propositions and present a new result obtained by taking into account the specific form 

of each example. 

Proposition 8 A*'" satisfies [P3bis(TQ, /c^'*, I)] (from Proposition 6), [P3bis(TQ, k{*, I)] 
(from Proposition 7) and [PSbisfTQ, A;^'*, 1)] where • stands for S, R, P and D, and 



• 


e 


'^inf,e 


'^sup,e 


^1 -" <up,e 


r b,» -1 Ksup.e 
fe ' = 1 - a .^ 


kr 


s 


M+ 


1 


1 


a^ 


1-a 




R 


[C1,C2] 


1 


1 




1-a^ r^p<^; 


(^ n\ i ^ " 1 


2v^ 


2v^ 


^' "H' (i+;^)^j 


R 


R+ 










(i-«)2 


P 


[C1,C2] 


C2 


Cl 




l-ap (ifp< ^) 




D 


[C1,C2] 


1 
'^2 


1 




l-ap^ (tfp<^) 


1 


l+T^P 
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with < ci < C2 < +00, p = ^. 

Proof. The computations of k^'* and k^'* derive from Proposition 6 and 7. Hence, 
let us concentrate only on k^'* for the complexity terms C^ , C^ and C^ . Let us 
denote by m = min (//^ , Hy ) and by M = max (H-^ , Hy ) • 

• Complexity term C^: 



Dl^-A^^l=a{l-a)(VM-^) + aiM - m) 



a(l - a)— ^ ^ h a(M - m) 






M + ^/mY 



And so, 



Aj^ > (1 - a)D'^ ^ 1 - «- 



The result is obtained by noticing that 



D{ 



M + V^)2 



D{ 



^/m + y/M 



< 



< 



M 



^/m+ VM 
1 



y^^ V M) 



< 1. 



1 + J- 



Complexity term C : by using a Taylor expansion with integral rest, one obtains 



Di 



A^'Z = M" (M 



a I A/fl-a 



m 



\-a\ 



< (M - m) 



1 — a 



(m + t(M-m))' 
1 — a 



,dt 



° 'J + *(l-J) 



rdt 



< D 



X,Y 1 _ 1 
P 



P P 



1-Q 



z? 



1-a 



1-^ 
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And so, 



1 '1 



A'^:'^>Di.\l ^' 



l-a' 



X.Y — X,Y I 1—1 



which leads to the result. 
Complexity term C^ : 



l/ AD,a 



niM aM . , , . 1 



^'^ ^'^ aM+il-a)m aM + (I - a)m^ ^-i + lz:°£i x,y 



a C2 



3.3 Around the triangular inequality's property 

The question arises now whether an IB-divergence or a NIB-divergence satisfies the prop- 
erty [P6] that is a triangular inequality. The following proposition establishes sufficient 
conditions for such measures to constitute a metric. 

Lemma 9 

^x,y < -^x.z + Hy^z - Hz (29) 

Ix,Y > Ix,z + Iy,z - Hz (30) 

Proof. From general properties on entropy, one can obtain 

H-^Y < H^Y,z = Hx,z + Hy^x,z ^ H^z + Hy^z = Hx,z + Hy^z - Hz- (31) 
Equation (30) directly derives from (2). ■ 

Proposition 10 Assume the complexity term, defining an IB-divergence satisfies the 
following property: 

(^) 

Cx^.<Cx^z+Cy^z-Hz- (32) 

Then, the associated IB-divergence satisfies the triangular inequality, that is 

A < A + A . (33) 

X,Y — X,Z ' Y,Z V / 
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In addition, if C satisfies 

C^^>max{H^,H^), (34) 

then the associated NIB-divergence satisfies also a triangular inequality, that is 

5 <5 +5 . (35) 

X,Y — X,Z ' Y,Z \^"/ 

Proof. Since the following quantity 

is nonnegative from (30) and (32), we have immediately (33). Moreover, the following 
equation is valid 

Now, it is also easy to see from (34) that 



From (36) it follows 



r , ^X.Z ~ ^X,Z + ^Y.Z ~ ^Y,Z ^ Cx.Z ~ ^X,Z , ^^ - I^ z 

< '■ 7 '■ — ^ < '■ =0 +0 

maxfC^ ,C^ ^x,z W,z 



Remark 4 In Proposition 10, there is no implication between (32) and (34)- Indeed, 
one may check that the NIB-divergence 6^ (with a = 1/2 for example) satisfies the 
first one but not the second one. Now consider a NIB-divergence with complexity term 
C-^Y = max (H-^jHy) + H-^.^^yix- ^V choosing X,Y and Z such that H^ = 
^Y\x — ^x Y — ^z/3 = H-^ y/3 > 2, one asserts that (34) is satisfied but not (32). 

Remark 5 Let us consider a NIB-divergence 6 with complexity term C^ ^ = C'^ ^ + 
max (H-^jHy) such that C y > (necessarily C y = whenever X ~ Y). Then, A 
and 5 satisfy a triangular inequality if C also satisfies a triangular inequality. However, 
this is not a necessary condition. Indeed, the triangular inequality is not satisfied for the 
same example of the previous remark with C' ^ = H^.^H^,^ for which C' ^ = C' ^ = 
whereas C'^ y > 0. 
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Let us now propose some examples and consequences through the following corollary. 

Corollary 11 

(i) The measures D , D satisfy the condition (32) and so are metrics. 

(a) The measures d^ and d satisfy the conditions (32) and (34) and so are metrics. 

(Hi) The measure D '°^ for a < ^ satisfies the condition (32) and so is a metric. 
Moreover, when a > -^j this measure does not satisfy (32). 

(iv) Let {a^-'')j=i^...^j be some vector of probability weights for some J > 1. Let 
A^^', . . . , A''^), J IB-divergences (resp. 6^^' , . . . , 6^'^', J NIB -divergences) with complexity 
terms C^'^ , ■ ■ ■ , C^l, satisfying (32) (resp. (32) and (34)) then these measures defined 
by (27) satisfy a triangular inequality. 

Proof, (i) and (ii) Equation (29) corresponds exactly to (32) for C^ ^ = H^ ^ ■ And 
since H^ ^ > max {H -^ , H ^) , we have proved that D and d^ are metrics. Concerning 
D and d , the complexity term corresponds to C^ ^ = may: [11-^,11^). Thus it is 
sufficient to prove (32) which is quite obvious. Indeed, 

max {H-^ , H^ ) + max {Hy , H^ ) — H^ > max {H^ , Hy. ) . 

(Hi) Let m = min {H^ , H^ ) and M = max {H-^ , H^ ) ■ We distinguish three cases : 



H^ < m: 



Cx,z + C^;^ -H^ = (2a - 1)H^ + (1 - a){m + M) 

If a > 2 ^-iid H^ = Hy , the right-hand side of the previous equation equals 
(1 - 2a) (C^'"^ - H^) + Cf'° < C^'"^. And so, (32) can never be satisfied for 
a > 2- Now, if a < 2) we have 

Cx z + C^'z -H^>{l-a){m + M)> Cf'" 



X,Z Y ,Z 



H^ > M: 



CTz + CTz -H, = (2a - l)H^ + (1 _ a)(m + M) > a + (1 - a)M = Cf'" . 



m< H^ <M: 



Cx% + Cy.z -H^=am + {1- a)M = Cf '° 
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(iv) trivial. ■ 

We assert that the measures A^'", A^'" and A^'" (and so 5^'°, 5^'" and 5^'°) 
do not satisfy the condition (32). Consider for example A^'". Let us choose X, Y and 
Z such that H^ > max {H-^ , H^ ) and such that H^ = -^H^ = -^^H^ ■ This leads to 

cf e + cf e -H^=H^ (—- — ^ — -— + ^^ 



X,Z ' Y,Z Z ^^Z 



aH^ + (1 — Q;)//^ aH^ + (1 — Ci)H^ 



which is in contradiction with (32). 

Concerning these divergences and the measures A*^'" (for a > 2) and (5"^'", we do 
not know if they satisfy a triangular inequality but our tool cannot be applied to prove 
it. We propose to weaken the property [P6] in the following way in order to obtain more 
results. An IB-divergence or NIB-divergence satisfies 

[P6bis(T,c)] There exists c > 1 such that for all (X, Y), {Y , Z), {X , Z) G T 

A < ex (a + a 

X,Y — y X,Z ' Y ,Z 

Property [P6] is then equivalent to [P6bis(r^,l)] and we already know that D , d , 
d\ /and D'^'" (for a < i) satisfy [P6bis(r2,l)]. When T C F^ the property [P6bis] 
is in some sense local whereas it is global (as a classical triangular inequality) when 

T = r2. 

Let us notice that if an IB-divergence (or NIB-divergence ) satisfies [P3bis(T, A;i, ^2)]) 
then [P6bis(T,^)] is satisfied since 

A < k^D^ <k2(D^ +D^ ) < — (a +A 



X.Y — ^ X,Y — ^ \ X,Z Y ,Z J ~ U-, 



We then inherit a lot of results from Proposition 8 related to our examples. In par- 
ticular A*'° and (5*'° (where • stands for S,R,P and D) both satisfy [P6bis(Te,p-)], 
[P6bis(r|,^)] and [P6bis(r|,^)]. 

In the rest of this section, we attempt to ensure the global property [P6bis(r^,c)]. 
From Proposition 8 (with Q = M^), we assert that the divergences A"^'" (when a > i) 
and (J-^'" (resp. A^'° and 5^'") satisfy [P6bis(r2,j^)] (resp. [P6bis(r2,(j^)]). 
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When a < 2) we could improve the previous on A^'° by proving that it satisfies 
[P6bis(r^, :j ,(--._ -,2 )]) in the same spirit of the proof leading to [P3bis]. Indeed, 



dI'% - A^'° = a(l - a)(m + M - 2VmM 



< 2a(l - a) (dJ<^ + /^ y - V^^ 

< 2a(l - a)D'^^l 



which leads to A^'*^ > (q^ + (1 - a)'^)D^'^. Finally, let us notice that 

AR,a ^ r)S,a ^ j~)S,a _, r-\S,a ^ ^^ I \R,a , A-R,Q 

x,y — x,y — x,z ' y,z — ^,2 i n _ ^p \ x,z ~^ y,z 



We now give a further and general result allowing us, in particular, to improve [P6bis(r , 3^1^)] 
for A'^'" when a > 2- 

Proposition 12 Let us consider the following assumptions on a complexity term: there 
exists a constant c > 1 such that 

c C^,, + c C^,, -H^-{c-l) (I,,, + /,,,) > C^^ (37) 

cC',.,+cC^,-F,-(c-l)(/^,,+/,,,) >max(c^^,C^^,C^^). (38) 

// an IB- divergence satisfies (37) or a NIB -divergence satisfies (38), then they sat- 
isfy [P6bis(r^,c)J. 

Proof. Let us introduce 

A = - (c^^ -/^^) +CX (C^^ -/,,J+cx (C^^ -/,,,). 
From (30) and (37), one may assert that 

A >cC^^+cC^^ - C^^ -H,-{c-l) (/,,,+/,,,) > 0, 
which implies that the result is valid for A. Now, from (38) one can write 



^ + C^,, >max C^^^,C^^^ 



which leads to 



^x,. < ^ / ^\ < C X ^x^z + - X K^z- 

max C ^,a 



X,Z^ Y,Z 
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Corollary 13 

The measures A '" for a > -^ satisfy [PdbisfV'^ ,j^)] 

Proof. Let us concentrate on A'^'" for a > i. Let A = cC^'Z + cC^'Z — H„ — (c — 
1) (-^x z ^ ^Y z)- Without loss of generality, we assume H^ < H^ . We distinguish three 
cases: 

• H^ < H^ < Hy '■ we have 

A>c{l- a)H^ + (1 - a)HY + {ca + a - l)H^ - (c - 1)1 x,z- 
Then, 

^-Cj° > {c{l-a)-a)H^+{co^+a-l)H^-{c-l)I^^^ > (c-1) {H^ - I^^^) > 0, 
as soon as c > -r^- 

— I— a 

• H^ < H^ < H^ : we have 

A > aH^ + caliY + ((1 - a) + c(l -a) -l)H^ - {c- 1)1^^2 ■ 
Then, 
A-C^^l > {ca-{l-a))H^+{{l-a)+c{l-a)-l)H^-{c-l)I^^ > {c-l){H^-I^A > 0, 



as soon as c > -r^—. 

— 1— a 



H^ < H^ < Hy ■ we have 



A > caH^ + (1 - a)HY + (c(l - a)ii'2 + q - 1) - (c - 1)/^^,^. 

Then, 

^ - Cj"^ > (c - l)ai:^, + (c - 1)(1 - a)/7, - I^, > 0. 

Hence, we obtain for c = j^, A — C^'^ > 0. 
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Remark 6 The tool presented in Proposition 12 cannot he applied to the IB-divergence 
A '" and the NIB-divergence (5 '". Indeed, let us give some c > 1 and let us consider 
the quantity 

A = cC^'^ + cC^'^-H,-{c-l){l^,+I^,). 

In fact, one can always find X, Y, Z such that for all c > 1, the quantity A is negative. 
Indeed, let us choose Z independent of X and Y and such that aH^ + {\ — a)H^ = 3cH^ 
and aH^ + (1 — a)-ffy = ScH^ ■ Then, it is easy to see that A = H^ (^ + ^ — l") < 0. 
In the same manner, the tool is inapplicable to the IB-divergence A '" and the NIB- 
divergence 6^'". Indeed, let us give Z independent of X and Y and such that H^ = 
H^ = (i)^/"i?^, then 

A = cCj^+cC;;^-i7,-(c-l) (/,,,+/,,,) = -ii7,<0. 
The following result is an extension of Proposition 12 well-suited to be applied to 

Proposition 14 Let us assume that there exists two positive integer I and J such that 
a NIB-divergence 5 can he expressed as: 

I oii) 



I qW J / 



=1 ^X,Y j=l \ ^x,Y 



where {oc-^'^ .^^ , is some vector of probability weights. By denoting Sx,Y = Yli=i '^x Y 
and Ux,Y = niaxj=i^...^/ U^y, if there exists some real number c > 1 such that for any 
j = 1, • • • , J the following assumptions are satisfied: 

(i) ^(^■) = I^^ - C^Jl + c {Sx,z + Sz,y) > 0- 

(n) A(^') + C(/'_)^ > max{Ux,z, Uz,y)- 

then 5 satisfies [P6bis(r'^ ,c)]. 

Proof. Using assumptions (i) and (ii), one can prove that for all j = 1, . . . , J 
1 ^x,Y . 1 ^x,Y , Sx,Z + Sy,Z 



ci^'.V C^A + AU) max(t/x,z, Uz,y) 
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It follows that 



J / ^ 



j=l \ ^x,y/ j=l 



Corollary 15 The measure 6^'°' satisfies [P6bis(r'^,—!^)]. 
Proof. We have 

s:D,a ■ ( x\Y ^Y\x\ , /, x (^x\Y "^Y\x\ 
' = a mm ■ — , ■ — + 1 — a max ■ — , ■ — 

mill {H-^ , Hy ) max {H-j^ , H^ ) 

= a(l ^ -^ + (1 _ a) f 1 ^ 

\ mill [H^ , Hy ) / V max [H-^ , H^ 

By identification with notation introduced in Proposition 14, we have I = J = 2, Sj^y 

amm(H^^Yi^Y\x)^ '^X,Y ~ (1 ~ ") ™^X (-^x|y ' ^y|x ) ' ^X,Y ~ ™™(-f^x'-^y)) 
U^^]y = max(F^,i7y), CWy = min(F^,i?y) and C^^)^ = max(F^,i7y). Let us 
fix c to the value ^. We have 

v4(^) = I^Y - min (-H'x ' ^y ) + — («min (^x|z' -f^z|x) + (1 - a) max (^i/^i^, i/, 
-l-Qmin [Hy^zi^ziY ) + (^ ~ ") ™^^ [^y\z^^: 



.(1) ^ 



Z\Y 

Clearly from (29) 



^(^) > max {H^ , i^y ) - H^^ + 2i7^^ + 2i^y ^ - H^ - H^ - 2H^ 
> H^^ + H^^ - min {H^ ,H^)-H^>0. 

And one also has 

min {H^ ,Hy) + ^"^^^ > H^ ^ + H^ ^ -H^> max {H^ ,Hy,H^) = max {Ux,z, Uy,z) ■ 

It follows that A^^> fullfills conditions (?) and (ii) of Proposition 14 with c = ^^. The 
proof is strictly similar for A^'^'. ■ 
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4 Prediction framework 

We pay attention on properties related to the prediction of some fixed random vector Y. 

4.1 Prediction framework 

Recall that our purpose is to find the random vector X that minimizes A (resp. ^y x^ 
which combines a complexity term C^ ^ (to minimize) and an information term I^ ^ 
(to maximize). Let us imagine that we already get some X.\ and its associated measure 
^Y X ('^^SP- ^Y X )■ ^ft^'^ evaluating A^ ^ (resp. ^^ x )' ^^ ™^y ^^ interested in 
describing the conditions under which X.^ is better or worse than X\: 

Proposition 16 Two situations may occur 
Case 1: we choose X2 instead of Xi when 

K... < K.. ^ ^v.x. - ^v.x, < Iy.x. - Iy.x. (39) 



c. 


.^2 


c^ 


:Xl 


< 


Y ,X2 


Y ,X^ 


Cy 


.^2 ~ 

Cy 




.^1 


< 


Y,X2 ~ Y,Xi 

Iyx 

y,Xi 



6 <6 ^^ -^^ '-^ < -^ ^^ (40) 

Y,X2 Y ,x-^ r^ T V / 

Case 2: we keep Xi and reject X2 when 

\.X2 ^ \.X, ^ Cy.X2 - Cy.X, > Iy.X2 " ^V,X, (41) 

>5 ^y,x, - ^y.xi ^ K,X2 - K,x, , . 

"^y,X2 - V,xi C ~ I 

y,xi y,xi 

This result implies automatically that the properties [P8] and [P9] are satisfied. Let 
us comment more precisely the previous proposition: 

• Case 1 holds when 

1. X2 is simpler than Xi (i.e. C^ ^ — C^ x ^ ^) ^^*^ ^2 is at least as 
informative as Xi (i.e. Iy x ~ ^y x — 0)- 

2. X2 and Xi have the same complexity (i.e. C^ ^ — C^ x ~ ^) ^"^^ "^2 is 
more informative than Xi (i.e. Iy x ~ ^y x > 0)- 

3. X2 is simpler and less informative than Xi and such that the absolute (resp. 
relative) excess of complexity is lower than the absolute (resp. relative) gain of 
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c -c 
information that is C,, ^ — C,, ^ < I^ y — Iv y <0 (resp. — '-^^ '-^- < 

Y,X2 Y,Xi y ,Ji.2 I'.^i \ i YX 



x^_V^<0). 



y,Xi 



4. X2 is more complex and more informative than Xi and such that the absolute 
(resp. relative) excess of complexity is lower than the absolute (resp. relative) 

ffain of information that is < C,, ^ — C,^ ^ < /„ ^ — I^ ^ (resp. < 

y ,X2 y fX -^ y j^2 ")^i^ 

Y, X2 Y,X^ V,X2 V.Xi N| 



• Case 2 holds when 

1. X2 is at least as complex as Xi (i.e. C^ ^ — C^ ^ > 0) and X2 is at most 

y ,X2 y ,Xi 

as informative as Xi (i.e. /y j^ ~ ^y x — 0)- 

2. X2 is simpler and less informative than Xi, and such that the absolute (resp. 
relative) excess of complexity is greater than or equal to the absolute (resp. 
relative) gain of information that is > C^ ^ — C^ ^ ^ ^y x. ~ ^y x (resp. 

Q ^ ^Y,X2-^Y,X, > V.X,-V,Xi ^ 

3. X2 is more complex and more informative than Xi , and such that the absolute 
(resp. relative) excess of complexity is greater than or equal to the absolute 
(resp. relative) gain of information that is C^ ^ — C^ ^ ^ ^y x~ ^y x > 
(resp. '^^■^2"^y.xi > 'y,x2-Iy,x, ^ Q^_ 

Y,Xi Y,Xi 

Proposition 17 Any complexity term C" of the form, (17) satisfies [PIO]. 

Proof. Without loss of generality the function g{-) defining C" is assumed to be an 
increasing function. Hence, H^ > H^ implies that C" > C" . Now, let us assume 

° ' X2 — X-^ s^ Y ,X2 — Y.Xi ' 

CZ Y ^ C" ^ . We assert by denoting rrii = min(i/„ , H^ ) and Mi = max(i?„ , H„ ) 
for i = 1,2 

Cy,x2 ^ ^Y,x, ^^ 9-' (agimi) + (1 - a)ff(Mi)) < g-' (05(^2) + (1 - a)g{M2)) 
^^ a {g{m{) - ^(ms)) + (1 - a) {g{M^) - ^(Ms)) < 
Now, assume moreover that H^ > H^ , then the right-hand side is 

= (1 - a){g{H^^ - g{H^^ )) > if i:f, < H^^ < H^^ 
> gimi) - g{m2) =0 if H^,^ < H^ < H^^ . 

= aig{H^^) - giH^J) > if i:f^^ < H^^ < H^ 
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This leads to a contradiction which imphes that H-^ > H^ . ■ 

Remark 7 The complexity terms C^ and C^ do not satisfy the property [PIO] in 
the general case. Indeed, there is no implication for C and one can only prove that 
H^ > H^ =^ C£ ^ > C£ ^ . However, one can point out that when /„ „ = I^ y 

Xi — A2 Y,Xi — Y,X2 ' ^ Y,Xi Y ,X2 

then both C^ and C^ satisfy [PIO]. 

More specifically, two frameworks may be of special interest: 

• X2 is as informative as Xi (i.e. I^ x ~ -^y x )'■ ^^ expect to select the random 
variable with the smallest entropy. This is effectively what happens when [PIO] 
which is satisfied from Proposition 17 and Remark 7 (in this framework) 

C* with • = I, S, R, P, D in the general case and for C^ in this framework since 

Y ,X2 Y,Xi X2 Xi • 

• Xi = g{X2) with g some surjective (but not injective) mapping: X2 is more 
complex than Xi and X2 is at least as informative as Xi. Consequently, this case 
is not trivial since both absolute (resp. relative) excess of complexity and absolute 
(resp. relative) gain of information are competing. Let us give two important 
examples of such a context. 

1. quantization problem: given a quantized version Xi of some (continuous) 
random variable with its associated partition ^1, the problem is to know 
whether some new quantized version X2 with an associated partition A2 finer 
than Ai should be preferred to predict Y. 

2. variables selection problem: suppose one wants to construct an ascending 
selection method. The vector Xi could represent some selected set of covari- 
ables and X2 = (Xi,X2) a larger set of covariables. The aim is so to know 
if X2 should be integrated to the selected set or not. 

Some simple algorithms of quantization and selection methods are proposed in Robineau 
(2004) using these results. 
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4.2 Around the redundancy of two random vectors Xi and X2 

In the future use of an IB-divergence or NIB-divergence, one would expect that if two 
discrete-valued random vectors Xi and X2 have the same (or almost the same) infor- 
mation with respect to an IB-divergence or NIB-divergence, then both have the same 
effect on the prediction of another vector Y. This requirement, expressed by the prop- 
erty [Pll], could be used for example in a variables selection problem in the context of 
discrimination to detect redundant variables. 

In order to make the property [Pll] applicable for practical purpose, we may find 
interesting to have a bound of the difference lA,^ ^ — A^ ^ I (resp. 1(5,, ^ — S^^ ^ \) 
expressed in terms of D (resp. c^{- ^ )• More precisely, the question may arise 

whether there exists a function /i(-) satisfying h{x) ^ as x ^ and such that 

W,x, -\,xj ^ ^(^ii.x^) (^esP- K,x, -K,xJ ^ ^Ki.x^))- He^^^' according 
to our examples, we only concentrate ourself on linear function h{-). 

We then propose to translate the property [Pll] on an IB-divergence A (resp. a 
NIB-divergence 6) by: 

[Pllbis(T, k)] there exists some positive constant k such that for all {Xi, X2) G T C T^ 
such that 

lA - A \<kD^ (43) 

As a first answer, let us precise that if the IB-divergence (resp. NIB-divergence) 
satisfies a triangular inequality [P6bis(r^, 1)] and [P3bis(T, fci, /C2)] then it satisfies 
[Pllbis(T, /C2)] due to the equivalent expression of the triangular inequality as 

\D -D \<D (resp. Id -d \<d ). 

A priori, if an IB-divergence or NIB-divergence only satisfies [P6bis(r^,c)] with 
some c > 1, then this property does no more seem to be true: indeed, for all Y,Xi and 
X2, one may prove for an IB-divergence by instance that 

W,x, - K,X2 I ^ ^ X ^x„x, + (C - 1) "^i^ (^.,x, ' ^v,x,) i ^ X ^x„x, • 

Actually, this apparent disappointing result only expresses that a "redundancy" property 
cannot (always) be derived from a triangular's type inequality. 



Normalized information-based divergences 



30 



The following proposition gives some sufficient conditions required on some complex- 
ity term ensuring that the associated Aand ^satisfies the property [Pllbis] 

Proposition 18 (i) Assume there exists some positive constant ki such that the com- 
plexity term of an IB-divergence satisfies for all (Xi,X2) £ T" 



^y.xi - C'y,X2 



< Ki 



Xi X2 



(44) 



then Asatisfies [Pllbis{T, 1 -|- ki)] 

(ii) If in addition, there exists some positive constant K2 such that for all (J^i,X2) G ^ 



max(c^^^,C^^J>^2xC^^^ 
then the associated NIB-divergence satisfies [Pllbisi T, ^^ )/ 
Proof, ii) Let us start to write 



(45) 



A -A 

y,Xl Y,X2 



< 



Y,Xi Y ,X2 



+ 



'^Y,X, - ^Y,X2 



Now, notice that 



Y,X^ — ^Y,Xo +-'x, ,Xt -"x, ' 



from which one can deduce 



(46) 






y,Xi Y,Xo 



< max ( H^ , H^ ) — / 



-"^1 ' ^2 J ^1'^'2 



max(H^ IX '^x IX ) =^i V • (47) 

V X1IX2' X2\Xil Xi,X2 ^ ' 



The result is then obtained by combining (44), (46) and (47). 
(ii) We can obtain the following result 



y,Xi Y ,Xn 



< 



< 



mm(C , C 

^ Y ,Xi ' Y,X2 ' 



y,Xi Y ,Xo 



+ 



y,Xi Y .Xo 



c„ „ c. 



y,Xi y.x-i 






y.Xi ^Y .X,, 



+ 



a, ^ - c. 



Y .X^ Y.Xn 



max ( C^ y , C, 



y,Xi' y,X2 



The result then comes from (44), (45) and (47). ■ 

Let us apply the previous result to our different examples: 
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Corollary 19 Let Xi,X2 G Tq with G = [ci,C2] and define 7j (i = l,2j such that 
Ci = ^iH^ , then 



A* -A* 



<(l + KTe) D 



I 

Xi,X2 



and 



5' -5' 

y,Xi Y,X2 






(48) 



where • stands for S,R,P and D, and where the different constants are expressed by 



• 


'^i,e 


'^2,0 


s 


a^ 


(1 - q) + 071,2 


R 


V2 , a(l-a) 


((i-Qj +0^71,2) 


P 


max (^ i^J" , ^, l]o,i] (71) j 


7l,2 


D 


aV 1 


u+(l ")) 


(aAy2 (1+T,i_2)^ 



with 71,2 = min (71,^ 



Proof. For the sake of simplicity, let us denote by mj = min ( H^ , H-^^ J (resp. m = 
min (//y , H-j^ )) and by Mj = max ( Hyr , H-^^ 1 for i = 1, 2 (resp. M = max (i^y , H-^)). 
Let us notice on the one hand that |Mi + mi — (M2 + m2)\ = \Hx ~ ^x I ^'^'^ °^ ^^^ 



other hand that 



mm (1,71) I / 1 

m>{ . X V > min 1,71,— M = 7i,2M 

minfl,:^) I V 72.' 



Complexity term C : we have 



\C: 



S,a 



a 



S,a 



= \ami + (1 — a)Mi — am2 — (1 — a)M2\ 
= \a{mi - m2) + (1 - a){Mi - M2)\ 
<a^\H^ -H^ I 



Moreover, 



c: 



am + (1 - a)M > ((1 - a) + a7i,2) M 



Complexity term C^ : we have 



I a 



R 



-a 



R 



Y,X^ Y,Xn 



a 



(mi - m2) + (1 - ay {Ml - M2) + 2a(l - a)^H^ (J^x, " JH 
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Furthermore, we may obtain 

|a^(mi - ms) + (1 - af{Mi - Mg)] < a"^'^\H^^ - H^^ 



and 



^(JH 



Hence, 



H^ 



x\H^ -H^ I < — —\H^ -H 



2Jmm{H^^,H^ 



Xi X 



2^71' "^ 



X, ^^Xn 



\C5. -C5.\<ia-' + ^^^^^]\H._-H. 



y.Xi Y,Xo 



Moreover, one can prove 



C^'.x = ("v^ + (1 - a)^f > ((1 - a) + aVlT^^M 



CP : 



Complexity term 



: we have (by assuming H^ 



= < 



i7° [h]^'^ - Hi-"^^ 



i^i-" (f° - i:fe 



>^. 



if iJy < min [H^^ , iJ^^ 



Note that the third case cannot occur if 



fjafjl-a _ ^a fjl-a otherwise. 

5^X2 Xi Y 



71 >1- 



f^P,a f^P,oi 



Y,X2^ — 



^ (^X2 - ^x, ) if ^v < min (i7,^ , i7. 
^ if i7y > max ( H-j^ , i7 

otherwise 



_a / TT _ 



7i 

-fr„ — H ^ 






< max 



Moreover, we may obtain 



'■2 -^1 



C^'° = m"M^-" > 7f_2^ 
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Complexity term C^ : we have 



' ^'^1 ^'^2' {aMi + {1 - a)mi){aM2 + {I - a)m2) 



Finally, we also have 



a 



D,a 



< 



a 



V 



M1M2 



{a^y {mi + Mi){m2 + M2) 



H^ — H^ 



< 



a 



1 



(aA)2 {l+j,^^y 



H^_ — H^ 



a 1 — Q 

m M 



-1 / \ -1 

> ( — + (1 - a) ) M. 

V7l,2 



(49) 
(50) 
(51) 



Remark 8 Note that when a < 2, the measure A '" is a metric and so we derive (48) 
directly from [P3bisJ. 
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